Extraction of Phrase-based Concepts in Vulnerability Descriptions through Unsupervised Labeling

نویسندگان

چکیده

Software vulnerabilities, once disclosed, can be documented in vulnerability databases, which have great potential to advance analysis and security research. People describe the key characteristics of software vulnerabilities natural language mixed with domain-specific names concepts. This textual nature poses a significant challenge for automatic knowledge embedded text. Automatic extraction aspects is highly desirable but demands effort manually label data model training. In this paper, we propose unsupervised methods extract important concepts descriptions (TVDs). We focus on six types phrase-based (vulnerability type, vulnerable component, root cause, attacker impact, attack vector) as they are much more difficult than name- or number-based entities (i.e., vendor, product, version). Our approach based observation that same-type phrases, no matter how differ sentence structures phrase expressions, usually share syntactically similar paths parsing trees. Specifically, present source-target neural architecture learns Part-of-Speech (POS) tagging identify token’s functional role within TVDs, where source trained capture common features found TVD corpus, target linguistically malformed words specific domain. evaluation confirms proposed tagger outperforms (4.45%–5.98%) taggers designed notions identifies broad set TVDs contents. Then, observations, two path representations (absolute relative paths) use an auto-encoder encode such syntactic similarities. To address discrete our paths, enhance traditional Variational Auto-encoder (VAE) Gumble-Max trick categorical distribution thus create Categorical VAE (CaVAE). latent space absolute further apply clustering techniques generate clusters effectiveness CaVAE, achieves small (85.85) log-likelihood encoding accuracy (83%–89%) resulting clusters. The accurately from corpus way. Furthermore, these labeled mapped back corresponding phrases original produce labels used train concept models other corpora. work, (concept classification sequence labeling model) demonstrate utility unsupervisedly study shows outperform (3.9%–5.14%) those datasets previous work due consistent boundary typing by method.

برای دانلود باید عضویت طلایی داشته باشید

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

An Unsupervised Model for Joint Phrase Alignment and Extraction

We present an unsupervised model for joint phrase alignment and extraction using nonparametric Bayesian methods and inversion transduction grammars (ITGs). The key contribution is that phrases of many granularities are included directly in the model through the use of a novel formulation that memorizes phrases generated not only by terminal, but also non-terminal symbols. This allows for a comp...

متن کامل

Extraction and 3D Segmentation of Tumors-Based Unsupervised Clustering Techniques in Medical Images

Introduction The diagnosis and separation of cancerous tumors in medical images require accuracy, experience, and time, and it has always posed itself as a major challenge to the radiologists and physicians. Materials and Methods We Received 290 medical images composed of 120 mammographic images, LJPEG format, scanned in gray-scale with 50 microns size, 110 MRI images including of T1-Wighted, T...

متن کامل

Corpus Based Unsupervised Labeling of Documents

Text categorization involves mapping of documents to a fixed set of labels. A similar but equally important problem is that of assigning labels to large corpora. With a deluge of documents from sources like the World Wide Web, manual labeling by domain experts is prohibitively expensive. The problem of reducing effort in labeling of documents has warranted a lot of investigation in the past. Mo...

متن کامل

Hierarchical Phrase-Based Grammar Extraction in Joshua

While example-based machine translation has long used corpus information at run-time, statistical phrase-based approaches typically include a preprocessing stage where an aligned parallel corpus is split into phrases, and parameter values are calculated for each phrase using simple relative frequency estimates. This paper describes an open source implementation of the crucial algorithms present...

متن کامل

Unsupervised Knowledge Extraction for Taxonomies of Concepts from Wikipedia

A novel method for unsupervised acquisition of knowledge for taxonomies of concepts from raw Wikipedia text is presented. We assume that the concepts classified under the same node in a taxonomy are described in a comparable way in Wikipedia. The concepts in 6 taxonomies extracted from WordNet are mapped onto Wikipedia pages and the lexico-syntactic patterns describing semantic structures expre...

متن کامل

ذخیره در منابع من

ذخیره در منابع من قبلا به منابع من ذحیره شده

{@ msg_add @}

با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

ژورنال

عنوان ژورنال: ACM Transactions on Software Engineering and Methodology

سال: 2023

ISSN: ['1049-331X', '1557-7392']

DOI: https://doi.org/10.1145/3579638